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Abstract 

In this paper, we propose a skeleton matching based approach which aids in text localization in scene 
images. The input image is preprocessed and segmented into blocks using connected component 
analysis. We obtain the skeleton of the segmented block using morphology based approach. The 
skeletonized images are compared with the trained templates in the database to categorize into text 
and non-text blocks. Further, the newly designed geometrical rules and morphological operations are 
employed on the detected text blocks for scene text localization. The experimental results obtained 
on publicly available standard datasets illustrate that the proposed method can detect and localize the 
texts of various sizes, fonts and colors. 
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1. Introduction 


Text localization in document/scene images and video frames aims at designing an ad¬ 
vanced optical character recognition (OCR) systems. However, the large variations in text 
fonts, colors, styles, and sizes, as well as the low contrast between the text and the complicated 
background, often make text detection extremely challenging. The researcher’s experimental 
results on such complex text images/video reveals that the applications of conventional OCR 
technology leads to poor recognition rates. Therefore, efficient detection and segmentation of 
text blocks from the background is necessary to fill the gap between image/video documents 
and the input of a standard OCR system. 

The text-based search technology acts as a key component in the development of ad¬ 
vanced image/video annotation and retrieval systems. In general, the methods for detecting 
text can be broadly categorized into five classes based on the features associated with the 
text. The various approaches are connected component analysis(CCA) approach, edge-based, 
corner-based, texture-based and stroke-based approaches. The connected component analysis 
approaches 117], S assume that the pixel in text regions have homogeneous color, intensity, 
texture. The color-based methods are simple and are suitable only to simple background. 
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The edge-based approaches |3],|Q] require text to have a reasonably high contrast to the back¬ 
ground in order to detect the edges. These methods often encounter problems with complex 
backgrounds and produce many false positives. The comer-based methods jsll. lfl^ extract 
corner features to detect the text in images. The corner based methods are more effective but 
detecting the corners generally is a time consuming task. The texture-based methods iQ], 10] 
assume text regions to have some kind of special textures. The texture-based methods are 
time-consuming and sometimes are influenced by the fonts and styles of characters. The 
stroke-based method|[I[] captures the intrinsic characteristics of text strokes so that the better 
detection results have been obtained even in complex background. 

Although the text recognition in documents is satisfactorily addressed by state-of-the- 
art OCR systems, the text localization and recognition in images of real-world scenes has 
received significant attention in the last decade. Hence, the scene text localization and recog¬ 
nition is still an open problem. In this context, we propose a new approach for text localization 
in scene images. Our method aims to detect the text in the input image by performing certain 
preprocessing. Further, the preprocessed image undergoes segmentation, skeletonizing, tem¬ 
plate matching, classification and localization. The remaining part of the paper is organized 
as follows. The proposed approach is discussed in section 2. Experimental results and com¬ 
parison with other approaches are presented in section 3 and conclusion is given in section 
4. 

2. Proposed Methodology 

The flowchart of the proposed text detection and localization approach is shown in Fig. [T] 
The details of each processing blocks are discussed below. 



Figure 1. Flowchart of the proposed system 


2.1. Preprocessing & Segmentation 

In this phase, the input image is segmented into small blocks using connected component 
analysis. The given input image is first converted into a gray image and median filtering is 
employed to remove the noise in the resultant image. We employ an efficient binarization ap¬ 
proach over the filtered image to get a binarized image. If the background in the image is rel¬ 
atively uniform then a global threshold value is used to binarize the image by pixel-intensity. 
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If there is large variation in the background intensity then adaptive thresholding (local or dy¬ 
namic thresholding) may produce better results. The conventional thresholding operator uses 
a global threshold for all pixels whereas adaptive thresholding changes the threshold dynam¬ 
ically over the image. Adaptive thresholding is a form of thresholding that takes into account 
spatial variations in illumination. The filtered image is further binarized using local adaptive 
thresholding which selects an individual threshold for each pixel based on the range of inten¬ 
sity values in its local neighborhood as shown in Fig. |2j 


Original Image Gray Image 


selt-adhesive sell-adhesive 

address labels address labels 

250 on a roll 250 on a roll 


Filtered image Image after adaptive thresholding 



Figure 2. Results of Binarization 

For segmentation purpose, we perform connected component analysis, a technique that scans 
and labels the pixels of a binarized image into components based on pixel connectivity. The 
connected component labeling detects the connected regions in the binary images. The bina¬ 
rized image is now segmented into blocks using connected component analysis. 

2.2. Skeletonization 

Skeletonization ifl^l , ifl^ is a morphological operation that is used to remove selected fore¬ 
ground pixels from binary images. Skeletonization is normally applied only to a binary im¬ 
age and it produces another binary image as its output. The skeleton of the character im¬ 
age is important for its detection. Hence, the redundant information may be rejected using 
skeletonization ifisll . We now find the skeleton of the segmented block which leads to an im¬ 
age with single pixel width maintaining the basic shape of the image. The resultant images 
are resized to form templates of fixed size. These unknown templates are used to compare 
with the known templates in the template database. 

2.3. Template Matching 

Template matching is a natural approach to pattern classification||3]- It involves deter¬ 
mining similarities between a given template and windows of the same size in an image and 
identifying the window that produces the highest similarity measure. It works by compar¬ 
ing derived image features of the image and the template for each possible displacement of 
the template. This process involves the use of a database of characters or templates. We 
have created a template for all possible input characters. The skeleton of all the characters in 
the dataset is refined to fit into a window without white spaces and the template is created. 
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Figure 3. The skeletonized database of subset of characters 

The templates are normalized to 42 x 24 pixels and stored in the database. The skeletonized 
database of such subset of characters are shown in Fig. [S] 

2.4. Matching Strategy 

In the proposed method, text classification is done using template matching based on 2- 
D correlation coefficients between the characters as shown in Fig. |4] During the process of 



classification, the current input character is compared to each template to find either an exact 
match or the template with the closest representation of the input character. If I(x, y) is the 
input character, Tn(x,T) is the template n, then the matching function s(/, Tn) will return a 
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value indicating how well the template 'n' matches the input character. Normalization is done 
through mask processing to view the transformation. This mapping is used to map every pixel 
of the original image to the corresponding pixel in the normalized image. After normalization, 
the skeletonized character of the input test image is further matched with all the skeletonized 
characters in the template database using 2-D normalized correlation coefficients approach to 
identify similar patterns between a test image and the standard database skeletonized images 
i.e., 

zr-o Z'-o a a, j) - |/|) (Tn a, j) - \Tn\) 
s{.l,Tn) = — ^ ( 1 ) 

Vzr=0 z;=0 (h j) - 1^1)' (/, j) - \Tn\f 

This method is efficient and has high speed when dealing with character identification. For 
templates without strong features or when the bulk of the template image constitutes the 
matching image, a template-based approach may be effective. Since template-based matching 
may potentially require sampling of a large number of points, it is possible to reduce the num¬ 
ber of sampling points by reducing the resolution of the search and template images by the 
same factor. The resized image of skeletonized characters (alphabets A-Z, a-z , numbers 0-9) 
is further used for formation of character templates. These character templates are in the form 
of feature vectors which are stored as reference data pattern. The reference data pattern is 
used at the time of template matching to the appropriate character. Once the template match¬ 
ing is performed, the template having the highest similarity is said to be highly correlated and 
hence it is considered as the best matched template. This template is further used for classi¬ 
fying whether the segmented block is a text block or a non-text block as illustrated in Fig. |5l 
Character recognition is based on the previously constructed database which contains the im¬ 
portant features related to the characters that are already known. It shall be observed here 
that the database can be made self expandable to accommodate new font styles, new alphabet 
sets etc. to increase the localization accuracy. The skeleton of the characters in various font 
styles are stored in the knowledge base which is used for matching. In the classification phase, 
the learned prototypes are used to classify the unknown incoming patterns to the class of the 
matching prototype. In this phase, we extract features of the segmented block and compares 
these features with those recorded in the database. If the features are matched completely or 
closely matched, then the segmented input block is classified into a known class (text blocks). 
Otherwise, the segmented block belongs to a class of non-text blocks as depicted in Fig. |5l 

2.5. Text Localization 

The objective of text localization is to place rectangles of varying sizes covering the text 
regions ifl^ . We start by merging all the text detections. We employ geometrical analysis to 
identify the text components and group them to localize text regions. The false positives are 
eliminated by computing height, width, aspect ratio(A/?) and using some geometrical rules 
devised based on edge area(£'A) of the text blocks. 
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Figure 5. Text classification of the segmented blocks using template matching 


AR = width I height 
density = EA/{height * width) 

According to the attributes of the horizontal text line, we make the following rules to confirm 
on the non text blocks. 


i) AR < Tl\\density < T2 

ii) height > 50\\height < 6 

iii)width < 5\\height * width < 24 

The candidate text lines are obtained by applying these rules. The thresholding values T\ and 
T2 are the calculated mean and standard deviation respectively. Then, we label the connected 
components by using 4-connectivity. The foreground connected components for each of these 
images are considered as text candidates. Further, the morphological dilation operation is 
performed to fill fhe gaps inside fhe obfained fexf regions which yields heller resulls and Ihe 
boundaries of fexf regions are idenlified. 

3. Experimental Results 

This section presents the experimental results to reveal the success of the proposed ap¬ 
proach. The evaluation of the system on the ICDAR database shows that it is capable of 
detecting and locating texts of different sizes, styles and types present in natural scenes. 
The performance of the proposed approach is evaluated for scene images with respect to 
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Original Image 


Detected Text 


Detected text region 


Localized Text 


Figure 6. Sample results of image localization 


Table 1. Evaluation performance for ICDAR 2003 dataset 


Methods 

R 

P 

E 

Pan et. all 121 

0.67 

0.71 

0.69 

Epshtein et. al[lj 

0.73 

0.60 

0.66 

Proposed 

0.83 

0.79 

0.81 


f-measure(F) which is a combination of two metrics: precision(P) and recall(R) and the re¬ 
sults are reported in Table [T] We have conducted experiments on ICDAR 2003 text locating 
competition dataset lT^ and the localized text blocks are shown in Fig. 0 In order to exhibit 
the performance of the proposed approach, we have made a comparative study with the state 
of the art text localization approaches iflill IHl . 

Pan et.al il2ll used a Conditional Random Field (CRF) model and energy minimization ap¬ 
proach to detect and localize the text. However, this method was meaningful for unconstrained 
scene text localization. Epshtein et. a# employed the stroke width transform method, canny 
edge detector and connected component analysis for text localization. The limitation of the 
method is its dependency on successful edge detection which likely failed on blurred or low- 
contrast images. 
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Table 2. Evaluation performance for ICDAR 2011 dataset 


Methods 

R 

P 

E 

Yi et. alf9] 

0.71 

0.67 

0.62 

Neumann et. all 111 

0.65 

0.63 

0.69 

Proposed 

0.85 

0.81 

0.83 



Figure 7. Sample results of localization as found in literature 


The proposed method is also evaluated usin^he same parameters on ICDAR 2011 dataset 
and the results are reported in Table |2] Yi et.al|[9(] employed a text region detector, condition 
random field model and learning-based energy minimization approach to detect and localize 
the text in natural scene images. This approach obtained better recall rate but it was time 
consuming. Neumann et. al lflll] achieved a good performance for character detection by 
using an efficient sequential selection method for characters from the set of Extremal Regions 
(ERs). The method fails against noise and low contrast of characters which was demonstrated 
by false positives that exist due to watermarked text. 

The proposed approach has gained slightly improved precision and recall rates on IC¬ 
DAR 2011 dataset and comparable results for ICDAR 2003 dataset when compared to other 
approaches in the literature. Eor visualization purpose, we have shown the sample text local¬ 
ization results of othersll^ are shown in Eig.|7] The sample text line localization results of our 
proposed method is shown in Eig. |8] The evaluation performance for ICDAR dataset of the 
proposed method when text line is localized is highlighted in Table |3l 























Word Localizatioo 



Line Localization 


Figure 8. Sample text localization results of the proposed approach if bounding boxes are placed word-wise and 
line-wise. 


Table 3. Evaluation performance of text line localization for the proposed method 


Methods 

R 

P 

F 

ICDAR 2003 

0.86 

0.83 

0.80 

ICDAR 2011 

0.84 

0.79 

0.82 


4. Conclusions 

Text embedded in scene images contain abundant high level semantic information which 
is important to analysis, indexing and retrieval. We developed a skeleton matching based 
approach that classifies the text blocks through template matching. The newly developed 
approach is capable of localizing the text regions in scene images. Experimental results show 
that the proposed method is effective for identifying text and non text blocks. The various 
problems that occur due to complex background need to be addressed in our future works. 
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